{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "7bUMANd9dKsQ"
   },
   "source": [
    "<a href=\"https://colab.research.google.com/github/jeffheaton/app_deep_learning/blob/main/t81_558_class_06_1_transformers.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "nwue4RDMdKsS"
   },
   "source": [
    "# T81-558: Applications of Deep Neural Networks\n",
    "**Module 6: ChatGPT and Large Language Models**\n",
    "* Instructor: [Jeff Heaton](https://sites.wustl.edu/jeffheaton/), McKelvey School of Engineering, [Washington University in St. Louis](https://engineering.wustl.edu/Programs/Pages/default.aspx)\n",
    "* For more information visit the [class website](https://sites.wustl.edu/jeffheaton/t81-558/)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "iZEDDiENdKsS"
   },
   "source": [
    "# Module 6 Material\n",
    "\n",
    "* **Part 6.1: Introduction to Transformers** [[Video]](https://www.youtube.com/watch?v=mn6r5PYJcu0&list=PLjy4p-07OYzulelvJ5KVaT2pDlxivl_BN) [[Notebook]](t81_558_class_06_1_transformers.ipynb)\n",
    "* Part 6.2: Accessing the ChatGPT API [[Video]](https://www.youtube.com/watch?v=tcdscXl4o5w&list=PLjy4p-07OYzulelvJ5KVaT2pDlxivl_BN) [[Notebook]](t81_558_class_06_2_chat_gpt.ipynb)\n",
    "* Part 6.3: Llama, Alpaca, and LORA [[Video]](https://www.youtube.com/watch?v=oGQ3TQx1Qs8&list=PLjy4p-07OYzulelvJ5KVaT2pDlxivl_BN) [[Notebook]](t81_558_class_06_3_alpaca_lora.ipynb)\n",
    "* Part 6.4: Introduction to Embeddings [[Video]](https://www.youtube.com/watch?v=e6kcs9Uj_ps&list=PLjy4p-07OYzulelvJ5KVaT2pDlxivl_BN) [[Notebook]](t81_558_class_06_4_embedding.ipynb)\n",
    "* Part 6.5: Prompt Engineering [[Video]](https://www.youtube.com/watch?v=miTpIDR7k6c&list=PLjy4p-07OYzulelvJ5KVaT2pDlxivl_BN) [[Notebook]](t81_558_class_06_5_prompt_engineering.ipynb)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "RrDcW1cUdKsT"
   },
   "source": [
    "# Google CoLab Instructions\n",
    "\n",
    "The following code ensures that Google CoLab is running the correct version of TensorFlow.\n",
    "  Running the following code will map your GDrive to ```/content/drive```."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "E97PfPCGdKsT",
    "outputId": "882f0b5f-005b-4055-e708-0b62433f67ad"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Note: not using Google CoLab\n"
     ]
    }
   ],
   "source": [
    "try:\n",
    "    from google.colab import drive\n",
    "    COLAB = True\n",
    "    print(\"Note: using Google CoLab\")\n",
    "except:\n",
    "    print(\"Note: not using Google CoLab\")\n",
    "    COLAB = False"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "NVW5XjandKsU"
   },
   "source": [
    "# Part 6.1: Introduction to Transformers\n",
    "\n",
    "Transformers are neural networks that provide state-of-the-art solutions for many of the problems previously assigned to recurrent neural networks. [[Cite:vaswani2017attention]](https://arxiv.org/abs/1706.03762) Sequences can form both the input and the output of a neural network, examples of such configurations include::\n",
    "\n",
    "* Vector to Sequence - Image captioning\n",
    "* Sequence to Vector - Sentiment analysis\n",
    "* Sequence to Sequence - Language translation \n",
    "\n",
    "Sequence-to-sequence allows an input sequence to produce an output sequence based on an input sequence. Transformers focus primarily on this sequence-to-sequence configuration.\n",
    "\n",
    "## High-Level Overview of Transformers\n",
    "\n",
    "This course focuses primarily on the application of deep neural networks. The focus will be on presenting data to a transformer and a transformer's major components. As a result, we will not focus on implementing a transformer at the lowest level. The following section provides an overview of critical internal parts of a transformer, such as residual connections and attention. In the next chapter, we will use transformers from [Hugging Face](https://huggingface.co/) to perform natural language processing with transformers. If you are interested in implementing a transformer from scratch, Keras provides a comprehensive [example](https://www.tensorflow.org/text/tutorials/transformer).\n",
    "\n",
    "Figure 10.TRANS-1 presents a high-level view of a transformer for language translation.\n",
    "\n",
    "**Figure 10.TRANS-1: High Level View of a Translation Transformer**\n",
    "![Transformer](https://data.heatonresearch.com/images/jupyter/transformer-1.jpg)\n",
    "\n",
    "We use a transformer that translates between English and Spanish for this example. We present the English sentence \"the cat likes milk\" and receive a Spanish translation of \"al gato le gusta la leche.\" \n",
    "\n",
    "We begin by placing the English source sentence between the beginning and ending tokens. This input can be of any length, and we presented it to the neural network as a ragged Tensor. Because the Tensor is ragged, no padding is necessary. Such input is acceptable for the attention layer that will receive the source sentence. The encoder transforms this ragged input into a hidden state containing a series of key-value pairs representing the knowledge in the source sentence. The encoder understands to read English and convert to a hidden state. The decoder understands how to output Spanish from this hidden state.\n",
    "\n",
    "We initially present the decoder with the hidden state and the starting token. The decoder will predict the probabilities of all words in its vocabulary. The word with the highest probability is the first word of the sentence. \n",
    "\n",
    "The highest probability word is attached concatenated to the translated sentence, initially containing only the beginning token. This process continues, growing the translated sentence in each iteration until the decoder predicts the ending token. \n",
    "\n",
    "## Transformer Hyperparameters\n",
    "\n",
    "Before we describe how these layers fit together, we must consider the following transformer hyperparameters, along with default settings from the Keras transformer example:\n",
    "\n",
    "* num_layers = 4\n",
    "* d_model = 128\n",
    "* dff = 512\n",
    "* num_heads = 8\n",
    "* dropout_rate = 0.1\n",
    "\n",
    "Multiple encoder and decoder layers can be present. The **num_layers** hyperparameter specifies how many encoder and decoder layers there are. The expected tensor shape for the input to the encoder layer is the same as the output produced; as a result, you can easily stack these layers.\n",
    "\n",
    "We will see embedding layers in the next chapter. However, you can think of an embedding layer as a dictionary for now. Each entry in the embedding corresponds to each word in a fixed-size vocabulary. Similar words should have similar vectors. The **d_model** hyperparameter specifies the size of the embedding vector. Though you will sometimes preload embeddings from a project such as [Word2vec](https://radimrehurek.com/gensim/models/word2vec.html) or [GloVe](https://nlp.stanford.edu/projects/glove/), the optimizer can train these embeddings with the rest of the transformer. Training your embeddings allows the **d_model** hyperparameter to set to any desired value. If you transfer the embeddings, you must set the **d_model** hyperparameter to the same value as the transferred embeddings.\n",
    "\n",
    "The **dff** hyperparameter specifies the size of the dense feedforward layers. The **num_heads** hyperparameter sets the number of attention layers heads. Finally, the dropout_rate specifies a dropout percentage to combat overfitting. We discussed dropout previously in this book.\n",
    "\n",
    "## Inside a Transformer\n",
    "\n",
    "In this section, we will examine the internals of a transformer so that you become familiar with essential concepts such as:\n",
    "\n",
    "* Embeddings\n",
    "* Positional Encoding\n",
    "* Attention and Self-Attention\n",
    "* Residual Connection\n",
    "\n",
    "You can see a lower-level diagram of a transformer in Figure 10.TRANS-2.\n",
    "\n",
    "**Figure 10.TRANS-2: Architectural Diagram from the Paper**\n",
    "![Attention is All you Need](https://data.heatonresearch.com/images/jupyter/transformer-2.jpg)\n",
    "\n",
    "While the original transformer paper is titled \"Attention is All you Need,\" attention isn't the only layer type you need. The transformer also contains dense layers. However, the title \"Attention and Dense Layers are All You Need\" isn't as catchy.\n",
    "\n",
    "The transformer begins by tokenizing the input English sentence. Tokens may or may not be words. Generally, familiar parts of words are tokenized and become building blocks of longer words. This tokenization allows common suffixes and prefixes to be understood independently of their stem word. Each token becomes a numeric index that the transformer uses to look up the vector. There are several special tokens:\n",
    "\n",
    "* Index 0 = Pad\n",
    "* Index 1 = Unknow\n",
    "* Index 2 = Start token\n",
    "* Index 3 = End token\n",
    "\n",
    "The transformer uses index 0 when we must pad unused space at the end of a tensor. Index 1 is for unknown words. The starting and ending tokens are provided by indexes 2 and 3.\n",
    "\n",
    "The token vectors are simply the inputs to the attention layers; there is no implied order or position. The transformer adds the slopes of a sine and cosine wave to the token vectors to encode position. \n",
    "\n",
    "Attention layers have three inputs: key (k), value(v), and query (q). This layer is self-attention if the query, key, and value are the same. The key and value pairs specify the information that the query operates upon. The attention layer learns what positions of data to focus upon.\n",
    "\n",
    "The transformer presents the position encoded embedding vectors to the first self-attention segment in the encoder layer. The output from the attention is normalized and ultimately becomes the hidden state after all encoder layers are processed. \n",
    "\n",
    "The hidden state is only calculated once per query. Once the input Spanish sentence becomes a hidden state, this value is presented repeatedly to the decoder until the decoder forms the final Spanish sentence.\n",
    "\n",
    "This section presented a high-level introduction to transformers. In the next part, we will implement the encoder and apply it to time series. In the following chapter, we will use [Hugging Face](https://huggingface.co/) transformers to perform natural language processing.\n",
    "\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Module 6 Assignment\n",
    "\n",
    "You can find the fifth assignment here: [assignment 6](https://github.com/jeffheaton/app_deep_learning/blob/main/assignments/assignment_yourname_class6.ipynb)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "colab": {
   "collapsed_sections": [],
   "name": "new_t81_558_class_10_4_intro_transformers.ipynb",
   "provenance": []
  },
  "kernelspec": {
   "display_name": "Python 3.9 (torch)",
   "language": "python",
   "name": "pytorch"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.16"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}
